Predicting Online News Article Popularity
Predicting Online News Article Popularity
- Introduction
- Dataset
- Importing the data
- Explore and Clean the dataset
- Exploration: Distributions of features
- Cleaning: Handle Categorical Data
- Cleaning: Handle Missing Data
- Transforming: Handle Skewness in Distributions
- Exploration: Number of shares by the weekday
- Exploration: Number of shares by publish tags
- Cleaning: Remove unwanted Columns
- Transformation: Generate binary response variable
- Modeling
Introduction
The aim of this projecr is to explore a dataset in depth, apply a business analytics mindset to implement appropriate predictive analytics, and communicate the findings effectively.
The dataset is comprised of some statistical measures on online news articles. Hence, this analysis builds a machine learning system on the dataset to predict the popularity of online news articles. The goal of the analysis is to use this system to configure and present future articles that sell more advertisement.
Dataset
The dataset set comes from UCI Machine Learning Repository: Online News Popularity Data Set. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).
Data Description
Attribute Information:
0. url: URL of the article
1. timedelta: Days between the article publication and the dataset acquisition
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the content
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the content
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in Mashable
29. self_reference_max_shares: Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
31. weekday_is_monday: Was the article published on a Monday?
32. weekday_is_tuesday: Was the article published on a Tuesday?
33. weekday_is_wednesday: Was the article published on a Wednesday?
34. weekday_is_thursday: Was the article published on a Thursday?
35. weekday_is_friday: Was the article published on a Friday?
36. weekday_is_saturday: Was the article published on a Saturday?
37. weekday_is_sunday: Was the article published on a Sunday?
38. is_weekend: Was the article published on the weekend?
39. LDA_00: Closeness to LDA topic 0
40. LDA_01: Closeness to LDA topic 1
41. LDA_02: Closeness to LDA topic 2
42. LDA_03: Closeness to LDA topic 3
43. LDA_04: Closeness to LDA topic 4
44. global_subjectivity: Text subjectivity
45. global_sentiment_polarity: Text sentiment polarity
46. global_rate_positive_words: Rate of positive words in the content
47. global_rate_negative_words: Rate of negative words in the content
48. rate_positive_words: Rate of positive words among non-neutral tokens
49. rate_negative_words: Rate of negative words among non-neutral tokens
50. avg_positive_polarity: Avg. polarity of positive words
51. min_positive_polarity: Min. polarity of positive words
52. max_positive_polarity: Max. polarity of positive words
53. avg_negative_polarity: Avg. polarity of negative words
54. min_negative_polarity: Min. polarity of negative words
55. max_negative_polarity: Max. polarity of negative words
56. title_subjectivity: Title subjectivity
57. title_sentiment_polarity: Title polarity
58. abs_title_subjectivity: Absolute subjectivity level
59. abs_title_sentiment_polarity: Absolute polarity level
60. shares: Number of shares (target)
Importing the data
The data is imported into R and it comprises of 61 features.The 61st feature i.e. shares is the target variable i.e. an article is eligible for publishing or good enough to sell advertisement if the number of shares are more than 14000.
## [1] "url" "timedelta"
## [3] "n_tokens_title" "n_tokens_content"
## [5] "n_unique_tokens" "n_non_stop_words"
## [7] "n_non_stop_unique_tokens" "num_hrefs"
## [9] "num_self_hrefs" "num_imgs"
## [11] "num_videos" "average_token_length"
## [13] "num_keywords" "data_channel_is_lifestyle"
## [15] "data_channel_is_entertainment" "data_channel_is_bus"
## [17] "data_channel_is_socmed" "data_channel_is_tech"
## [19] "data_channel_is_world" "kw_min_min"
## [21] "kw_max_min" "kw_avg_min"
## [23] "kw_min_max" "kw_max_max"
## [25] "kw_avg_max" "kw_min_avg"
## [27] "kw_max_avg" "kw_avg_avg"
## [29] "self_reference_min_shares" "self_reference_max_shares"
## [31] "self_reference_avg_sharess" "weekday_is_monday"
## [33] "weekday_is_tuesday" "weekday_is_wednesday"
## [35] "weekday_is_thursday" "weekday_is_friday"
## [37] "weekday_is_saturday" "weekday_is_sunday"
## [39] "is_weekend" "LDA_00"
## [41] "LDA_01" "LDA_02"
## [43] "LDA_03" "LDA_04"
## [45] "global_subjectivity" "global_sentiment_polarity"
## [47] "global_rate_positive_words" "global_rate_negative_words"
## [49] "rate_positive_words" "rate_negative_words"
## [51] "avg_positive_polarity" "min_positive_polarity"
## [53] "max_positive_polarity" "avg_negative_polarity"
## [55] "min_negative_polarity" "max_negative_polarity"
## [57] "title_subjectivity" "title_sentiment_polarity"
## [59] "abs_title_subjectivity" "abs_title_sentiment_polarity"
## [61] "shares"
Explore and Clean the dataset
To find the which of the remaining 60 features are best for predicting the shares we must inspect and look for patterns in the data through summaries and plots.
Exploration: Distributions of features
Lets check the distribution of values that each column takes. To do so we need to filter out the columns that have numeric values only and as seen from the result of str(news) below all the columns except url are numeric.
## 'data.frame': 39644 obs. of 61 variables:
## $ url : chr "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
## $ timedelta : num 731 731 731 731 731 731 731 731 731 731 ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_words : num 1 1 1 1 1 ...
## $ n_non_stop_unique_tokens : num 0.815 0.792 0.664 0.666 0.541 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_self_hrefs : num 2 1 1 0 19 2 20 20 0 1 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ data_channel_is_lifestyle : num 0 0 0 0 0 0 1 0 0 0 ...
## $ data_channel_is_entertainment: num 1 0 0 1 0 0 0 0 0 0 ...
## $ data_channel_is_bus : num 0 1 1 0 0 0 0 0 0 0 ...
## $ data_channel_is_socmed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ data_channel_is_tech : num 0 0 0 0 1 1 0 1 1 0 ...
## $ data_channel_is_world : num 0 0 0 0 0 0 0 0 0 1 ...
## $ kw_min_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_min : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_min_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_max_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ kw_avg_avg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ self_reference_min_shares : num 496 0 918 0 545 8500 545 545 0 0 ...
## $ self_reference_max_shares : num 496 0 918 0 16000 8500 16000 16000 0 0 ...
## $ self_reference_avg_sharess : num 496 0 918 0 3151 ...
## $ weekday_is_monday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_tuesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_wednesday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_thursday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_friday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_saturday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday_is_sunday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ is_weekend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LDA_00 : num 0.5003 0.7998 0.2178 0.0286 0.0286 ...
## $ LDA_01 : num 0.3783 0.05 0.0333 0.4193 0.0288 ...
## $ LDA_02 : num 0.04 0.0501 0.0334 0.4947 0.0286 ...
## $ LDA_03 : num 0.0413 0.0501 0.0333 0.0289 0.0286 ...
## $ LDA_04 : num 0.0401 0.05 0.6822 0.0286 0.8854 ...
## $ global_subjectivity : num 0.522 0.341 0.702 0.43 0.514 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ global_rate_positive_words : num 0.0457 0.0431 0.0569 0.0414 0.0746 ...
## $ global_rate_negative_words : num 0.0137 0.01569 0.00948 0.02072 0.01213 ...
## $ rate_positive_words : num 0.769 0.733 0.857 0.667 0.86 ...
## $ rate_negative_words : num 0.231 0.267 0.143 0.333 0.14 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ min_positive_polarity : num 0.1 0.0333 0.1 0.1364 0.0333 ...
## $ max_positive_polarity : num 0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
## $ avg_negative_polarity : num -0.35 -0.119 -0.467 -0.37 -0.22 ...
## $ min_negative_polarity : num -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
## $ max_negative_polarity : num -0.2 -0.1 -0.133 -0.167 -0.05 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity : num 0.188 0 0 0 0.136 ...
## $ shares : int 593 711 1500 1200 505 855 556 891 3600 710 ...
With url removed we also check if there are any nulls or NA in the data , turns out there arent any NULLs in the data.
## [1] FALSE
par(mfrow= c(3,4))
for(i in 2:length(news)){
hist(news[,i],main = names(news)[i] ,xlab=names(news)[i])
}Cleaning: Handle Categorical Data
Looking at the distributions of all the columns above we can conclude that all the columns 14-19 and 32-39 (see the names below) have binary data, hence it would be better to convert these into a factor.
## [1] "data_channel_is_lifestyle" "data_channel_is_entertainment"
## [3] "data_channel_is_bus" "data_channel_is_socmed"
## [5] "data_channel_is_tech" "data_channel_is_world"
## [7] "weekday_is_monday" "weekday_is_tuesday"
## [9] "weekday_is_wednesday" "weekday_is_thursday"
## [11] "weekday_is_friday" "weekday_is_saturday"
## [13] "weekday_is_sunday" "is_weekend"
Converting the columns mentioned above to factors.
for (i in names(news)[c(14:19,32:39)]){
news[,i]<-factor(news[,i])
}
str(news[,names(news)[c(14:19,32:39)]])## 'data.frame': 39644 obs. of 14 variables:
## $ data_channel_is_lifestyle : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
## $ data_channel_is_bus : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
## $ data_channel_is_socmed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ data_channel_is_tech : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
## $ data_channel_is_world : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ weekday_is_monday : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ weekday_is_tuesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_wednesday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_thursday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_friday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_saturday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday_is_sunday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ is_weekend : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Cleaning: Handle Missing Data
Missing values are coded as 0 in this dataset.Keeping apart the binary data columns there are around 3000 rows with missing data that needs to be cleaned first.
Columns with missing data:
## [1] "num_videos" "kw_min_min"
## [3] "LDA_04" "global_subjectivity"
## [5] "global_sentiment_polarity" "global_rate_negative_words"
## [7] "rate_positive_words" "rate_negative_words"
## [9] "max_positive_polarity"
Total records vs no. of unclean records
for(i in c(11,20,44,45,46,48,49,50,53))news_clean <- news[news[,i]!=0,]
print(paste0("Total Records: ",nrow(news),
" Unclean Records: ", nrow(news)-nrow(news_clean),
" i.e. ", 100*(nrow(news)-nrow(news_clean))/nrow(news), ' %' ))## [1] "Total Records: 39644 Unclean Records: 1217 i.e. 3.06982141055393 %"
Since the number of unclean rows is close to 3% , hence we can omit the bad rows.
Transforming: Handle Skewness in Distributions
Some of the variables have heavily right skewed distributions, including the response shares. So we will transform them to reduce the skewness. For those variables with all values bigger than 0, we use log, and other variable with 0, we use square root to transform them.We are omitting the response shares here.
Columns undergoing transformation:
## [1] "n_tokens_title" "n_non_stop_unique_tokens"
## [3] "num_hrefs" "num_self_hrefs"
## [5] "num_imgs" "kw_max_avg"
## [7] "kw_avg_avg" "self_reference_min_shares"
## [9] "self_reference_max_shares" "LDA_00"
## [11] "LDA_01" "LDA_02"
## [13] "LDA_03" "global_rate_positive_words"
Cleaning: Remove unwanted Columns
Removing datachannel created for the plot above.
Also deleting the url and timedelta columns.
Deleting column n_non_stop_words since it has only one value , hence its a constant.
Transformation: Generate binary response variable
Define articles with shares larger than 1400 (median) as popular article.
Modeling
Since our target is a class i.e. 1 for a popular aricle and 0 for a non popular article , we will apply classification methods like Knn , Classification and Regression Trees,C5.0 Trees and Random Forests to train our data and predict on a test set. We will generate the training and test sets for the same.
Create Training and Test sets
Split Train - 70%; Test - 30% and set a color palette for each method.
#set random situation
set.seed(100)
# Select traning data and prediction data
ind<-sample(2,nrow(news),replace=TRUE,prob=c(0.7,0.3))
color.knn<-'#efab69'
color.lda<-'#ab69ef'
color.qda<-'#69adef'
color.lr<-'#adef69'
color.cart<-"#72E2FF"
color.c50<-"#67B5DA"
color.rf<-"#68B518"
print(paste0('#Train: ', table(ind)[1],' #Test: ', table(ind)[2]))## [1] "#Train: 27046 #Test: 11381"
Check for collinearity
We need to see if any numerical columms are collinear to each before applying our algorithms.
corDF = cor(news[ind==1,names(dplyr::select_if(news, is.numeric))]);
dissimilarity <- 1 - abs(corDF);
distance <- as.dist(dissimilarity);
hc <- hclust(distance);
clusterV = cutree(hc,h=0.05);
df<-as.data.frame(clusterV)
df$columns<-rownames(df)
knitr::kable(df[(df$clusterV==32),])| clusterV | columns | |
|---|---|---|
| rate_positive_words | 32 | rate_positive_words |
| rate_negative_words | 32 | rate_negative_words |
From the cluster 32 we can see that rate_positive_words and rate_negative_words. Hence, we reomve rate_negative_words
LDA
Train the model on train set
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 98 -none- numeric
## scaling 49 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## xNames 49 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Predict on test set
news.lda.pred <- predict( news.lda,news[ind==2,])
news.lda.prob <- predict(news.lda,news[ind==2,],type='prob')Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3928 2182
## 1 1798 3473
##
## Accuracy : 0.6503
## 95% CI : (0.6415, 0.6591)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3003
## Mcnemar's Test P-Value : 1.271e-09
##
## Sensitivity : 0.6860
## Specificity : 0.6141
## Pos Pred Value : 0.6429
## Neg Pred Value : 0.6589
## Prevalence : 0.5031
## Detection Rate : 0.3451
## Detection Prevalence : 0.5369
## Balanced Accuracy : 0.6501
##
## 'Positive' Class : 0
##
ROC Curve
news.lda.roc <- roc(news[ind==2,]$shares,news.lda.prob[,2])
plot(news.lda.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.lda, print.thres=TRUE)QDA
Train the model on train set
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 98 -none- numeric
## scaling 4802 -none- numeric
## ldet 2 -none- numeric
## lev 2 -none- character
## N 1 -none- numeric
## call 3 -none- call
## xNames 49 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Predict on test set
news.qda.pred <- predict( news.qda,news[ind==2,])
news.qda.prob <- predict(news.qda,news[ind==2,],type='prob')Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4157 2582
## 1 1569 3073
##
## Accuracy : 0.6353
## 95% CI : (0.6263, 0.6441)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2697
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7260
## Specificity : 0.5434
## Pos Pred Value : 0.6169
## Neg Pred Value : 0.6620
## Prevalence : 0.5031
## Detection Rate : 0.3653
## Detection Prevalence : 0.5921
## Balanced Accuracy : 0.6347
##
## 'Positive' Class : 0
##
ROC Curve
news.qda.roc <- roc(news[ind==2,]$shares,news.qda.prob[,2])
plot(news.qda.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.qda, print.thres=TRUE)Logistic Regression
Train the model on train set
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3201 -1.0224 -0.6162 1.0655 2.6180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.540e+00 6.452e-01 -5.487 4.08e-08 ***
## n_tokens_title -1.391e-02 6.447e-02 -0.216 0.829129
## n_tokens_content 1.708e-04 4.953e-05 3.448 0.000564 ***
## n_unique_tokens -5.583e-02 3.746e-01 -0.149 0.881539
## n_non_stop_unique_tokens -3.908e-01 2.039e-01 -1.917 0.055262 .
## num_hrefs 7.959e-02 1.306e-02 6.093 1.11e-09 ***
## num_self_hrefs -1.714e-01 1.884e-02 -9.099 < 2e-16 ***
## num_imgs 3.671e-03 1.269e-02 0.289 0.772317
## num_videos -5.626e-03 3.581e-03 -1.571 0.116138
## average_token_length -1.049e-01 5.428e-02 -1.934 0.053164 .
## num_keywords 4.150e-02 8.842e-03 4.693 2.69e-06 ***
## data_channel_is_lifestyle1 -1.298e-01 8.648e-02 -1.501 0.133248
## data_channel_is_entertainment1 -3.745e-01 5.705e-02 -6.564 5.24e-11 ***
## data_channel_is_bus1 -7.764e-02 7.785e-02 -0.997 0.318602
## data_channel_is_socmed1 7.023e-01 8.363e-02 8.399 < 2e-16 ***
## data_channel_is_tech1 4.876e-01 8.035e-02 6.069 1.29e-09 ***
## data_channel_is_world1 -1.048e-01 7.728e-02 -1.357 0.174888
## kw_min_min 1.410e-03 3.719e-04 3.790 0.000150 ***
## kw_max_min 5.513e-06 1.082e-05 0.510 0.610326
## kw_avg_min -1.322e-04 7.115e-05 -1.858 0.063146 .
## kw_min_max -6.586e-07 2.613e-07 -2.520 0.011731 *
## kw_max_max -5.950e-07 1.340e-07 -4.440 9.02e-06 ***
## kw_avg_max -6.041e-07 1.887e-07 -3.202 0.001366 **
## kw_min_avg -5.739e-05 1.797e-05 -3.194 0.001403 **
## kw_max_avg -1.278e-02 1.590e-03 -8.038 9.11e-16 ***
## kw_avg_avg 7.115e-02 4.357e-03 16.332 < 2e-16 ***
## self_reference_min_shares 5.011e-03 4.125e-04 12.150 < 2e-16 ***
## self_reference_max_shares 3.102e-03 3.223e-04 9.624 < 2e-16 ***
## self_reference_avg_sharess -8.821e-06 9.027e-07 -9.772 < 2e-16 ***
## is_weekend1 8.444e-01 4.094e-02 20.626 < 2e-16 ***
## LDA_00 1.390e-01 1.843e-02 7.544 4.57e-14 ***
## LDA_01 -2.372e-02 1.594e-02 -1.489 0.136602
## LDA_02 -3.471e-02 1.771e-02 -1.960 0.049951 *
## LDA_03 4.222e-03 1.756e-02 0.241 0.809930
## LDA_04 2.349e-01 1.023e-01 2.297 0.021633 *
## global_subjectivity 1.089e+00 1.919e-01 5.677 1.37e-08 ***
## global_sentiment_polarity -7.936e-02 3.737e-01 -0.212 0.831829
## global_rate_positive_words -4.160e-02 7.552e-02 -0.551 0.581741
## global_rate_negative_words 3.547e+00 3.745e+00 0.947 0.343546
## rate_positive_words 4.127e-01 3.127e-01 1.320 0.186852
## avg_positive_polarity -4.401e-01 3.074e-01 -1.432 0.152217
## min_positive_polarity -4.086e-01 2.552e-01 -1.601 0.109382
## max_positive_polarity 9.209e-03 9.681e-02 0.095 0.924210
## avg_negative_polarity -4.970e-02 2.821e-01 -0.176 0.860152
## min_negative_polarity 7.079e-02 1.036e-01 0.683 0.494327
## max_negative_polarity 1.395e-01 2.337e-01 0.597 0.550546
## title_subjectivity 9.030e-02 6.177e-02 1.462 0.143765
## title_sentiment_polarity 1.567e-01 5.726e-02 2.736 0.006220 **
## abs_title_subjectivity 2.970e-01 8.260e-02 3.596 0.000324 ***
## abs_title_sentiment_polarity 1.814e-02 9.001e-02 0.201 0.840309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37482 on 27045 degrees of freedom
## Residual deviance: 33722 on 26996 degrees of freedom
## AIC: 33822
##
## Number of Fisher Scoring iterations: 4
Predict on test set
news.lr.pred <- predict( news.lr,news[ind==2,])
news.lr.prob <- predict(news.lr,news[ind==2,],type='prob')Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3926 2172
## 1 1800 3483
##
## Accuracy : 0.651
## 95% CI : (0.6422, 0.6598)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3017
## Mcnemar's Test P-Value : 3.941e-09
##
## Sensitivity : 0.6856
## Specificity : 0.6159
## Pos Pred Value : 0.6438
## Neg Pred Value : 0.6593
## Prevalence : 0.5031
## Detection Rate : 0.3450
## Detection Prevalence : 0.5358
## Balanced Accuracy : 0.6508
##
## 'Positive' Class : 0
##
ROC Curve
news.lr.roc <- roc(news[ind==2,]$shares,news.lr.prob[,2])
plot(news.lr.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.lr, print.thres=TRUE)KNN
Train the model on train set
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## terms 3 terms call
## xlevels 7 -none- list
## theDots 0 -none- list
Predict on test set
news.knn.pred <- predict( news.knn,news[ind==2,],type='class')
news.knn.prob <- predict(news.knn,news[ind==2,],type='prob')Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3321 2723
## 1 2405 2932
##
## Accuracy : 0.5494
## 95% CI : (0.5402, 0.5586)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.0985
## Mcnemar's Test P-Value : 9.566e-06
##
## Sensitivity : 0.5800
## Specificity : 0.5185
## Pos Pred Value : 0.5495
## Neg Pred Value : 0.5494
## Prevalence : 0.5031
## Detection Rate : 0.2918
## Detection Prevalence : 0.5311
## Balanced Accuracy : 0.5492
##
## 'Positive' Class : 0
##
ROC Curve
news.knn.roc <- roc(news[ind==2,]$shares,news.knn.prob[,2])
plot(news.knn.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.knn, print.thres=TRUE)CART
Train the model on train set
## Call:
## rpart(formula = shares ~ ., data = news[ind == 1, ], method = "class")
## n= 27046
##
## CP nsplit rel error xerror xstd
## 1 0.18105597 0 1.0000000 1.0000000 0.006209699
## 2 0.02711685 1 0.8189440 0.8202281 0.006089356
## 3 0.01329406 2 0.7918272 0.7923559 0.006052834
## 4 0.01000000 4 0.7652391 0.7681849 0.006017118
##
## Variable importance
## kw_avg_avg kw_max_avg
## 24 15
## kw_min_avg kw_min_max
## 10 9
## LDA_03 kw_avg_max
## 8 8
## data_channel_is_entertainment data_channel_is_tech
## 7 6
## data_channel_is_socmed LDA_04
## 6 4
## LDA_01
## 2
##
## Node number 1: 27046 observations, complexity param=0.181056
## predicted class=0 expected loss=0.4894994 P(node) =1
## class counts: 13807 13239
## probabilities: 0.511 0.489
## left son=2 (13891 obs) right son=3 (13155 obs)
## Primary splits:
## kw_avg_avg < 53.68422 to the left, improve=528.8528, (0 missing)
## kw_max_avg < 61.3408 to the left, improve=448.9506, (0 missing)
## self_reference_min_shares < 40.61553 to the left, improve=418.7020, (0 missing)
## self_reference_avg_sharess < 1896.167 to the left, improve=397.4711, (0 missing)
## LDA_02 < -0.6717228 to the right, improve=343.2252, (0 missing)
## Surrogate splits:
## kw_max_avg < 65.87515 to the left, agree=0.821, adj=0.631, (0 split)
## kw_min_avg < 1692.644 to the left, agree=0.725, adj=0.435, (0 split)
## kw_min_max < 2950 to the left, agree=0.696, adj=0.375, (0 split)
## LDA_03 < -2.995495 to the left, agree=0.685, adj=0.352, (0 split)
## kw_avg_max < 283366.9 to the left, agree=0.672, adj=0.325, (0 split)
##
## Node number 2: 13891 observations, complexity param=0.01329406
## predicted class=0 expected loss=0.3932762 P(node) =0.5136064
## class counts: 8428 5463
## probabilities: 0.607 0.393
## left son=4 (10673 obs) right son=5 (3218 obs)
## Primary splits:
## data_channel_is_tech splits as LR, improve=140.9526, (0 missing)
## self_reference_avg_sharess < 1866.833 to the left, improve=129.0816, (0 missing)
## kw_avg_max < 146818.8 to the right, improve=126.8164, (0 missing)
## is_weekend splits as LR, improve=126.3479, (0 missing)
## self_reference_min_shares < 39.36492 to the left, improve=123.2359, (0 missing)
## Surrogate splits:
## LDA_04 < 0.5105669 to the left, agree=0.898, adj=0.558, (0 split)
## num_self_hrefs < 3.239451 to the left, agree=0.772, adj=0.016, (0 split)
## average_token_length < 4.148773 to the right, agree=0.771, adj=0.013, (0 split)
## n_unique_tokens < 0.3240034 to the right, agree=0.769, adj=0.001, (0 split)
## LDA_03 < -4.003425 to the right, agree=0.769, adj=0.001, (0 split)
##
## Node number 3: 13155 observations, complexity param=0.02711685
## predicted class=1 expected loss=0.408894 P(node) =0.4863936
## class counts: 5379 7776
## probabilities: 0.409 0.591
## left son=6 (2611 obs) right son=7 (10544 obs)
## Primary splits:
## data_channel_is_entertainment splits as RL, improve=166.48210, (0 missing)
## self_reference_min_shares < 40.61553 to the left, improve=141.94850, (0 missing)
## self_reference_avg_sharess < 2974.167 to the left, improve=117.24110, (0 missing)
## is_weekend splits as LR, improve= 93.10397, (0 missing)
## self_reference_max_shares < 55.22495 to the left, improve= 92.74248, (0 missing)
## Surrogate splits:
## LDA_01 < -0.7274334 to the right, agree=0.864, adj=0.316, (0 split)
## num_videos < 21.5 to the right, agree=0.805, adj=0.018, (0 split)
## num_imgs < 7.035534 to the right, agree=0.805, adj=0.015, (0 split)
## n_non_stop_unique_tokens < -1.319281 to the left, agree=0.802, adj=0.003, (0 split)
## average_token_length < 3.803062 to the left, agree=0.802, adj=0.002, (0 split)
##
## Node number 4: 10673 observations, complexity param=0.01329406
## predicted class=0 expected loss=0.3541647 P(node) =0.394624
## class counts: 6893 3780
## probabilities: 0.646 0.354
## left son=8 (10107 obs) right son=9 (566 obs)
## Primary splits:
## data_channel_is_socmed splits as LR, improve=127.07840, (0 missing)
## kw_avg_max < 143976.2 to the right, improve=116.11490, (0 missing)
## kw_max_max < 654150 to the right, improve=107.19990, (0 missing)
## self_reference_min_shares < 40.61553 to the left, improve=102.78630, (0 missing)
## kw_min_min < 122.5 to the left, improve= 95.03958, (0 missing)
## Surrogate splits:
## num_self_hrefs < 6.36384 to the left, agree=0.947, adj=0.009, (0 split)
## num_keywords < 2.5 to the right, agree=0.947, adj=0.009, (0 split)
##
## Node number 5: 3218 observations
## predicted class=1 expected loss=0.4770044 P(node) =0.1189825
## class counts: 1535 1683
## probabilities: 0.477 0.523
##
## Node number 6: 2611 observations
## predicted class=0 expected loss=0.4312524 P(node) =0.09653923
## class counts: 1485 1126
## probabilities: 0.569 0.431
##
## Node number 7: 10544 observations
## predicted class=1 expected loss=0.3693096 P(node) =0.3898543
## class counts: 3894 6650
## probabilities: 0.369 0.631
##
## Node number 8: 10107 observations
## predicted class=0 expected loss=0.3359058 P(node) =0.3736967
## class counts: 6712 3395
## probabilities: 0.664 0.336
##
## Node number 9: 566 observations
## predicted class=1 expected loss=0.319788 P(node) =0.02092731
## class counts: 181 385
## probabilities: 0.320 0.680
Plot tree
Predict on test set
news.cart.pred<-predict( news.cart,news[ind==2,] ,type="class")
news.cart.prob<-predict( news.cart,news[ind==2,] ,type="prob")Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3414 1983
## 1 2312 3672
##
## Accuracy : 0.6226
## 95% CI : (0.6136, 0.6315)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2455
## Mcnemar's Test P-Value : 5.59e-07
##
## Sensitivity : 0.5962
## Specificity : 0.6493
## Pos Pred Value : 0.6326
## Neg Pred Value : 0.6136
## Prevalence : 0.5031
## Detection Rate : 0.3000
## Detection Prevalence : 0.4742
## Balanced Accuracy : 0.6228
##
## 'Positive' Class : 0
##
ROC Curve
news.cart.roc <- roc(news[ind==2,]$shares,news.cart.prob[,2])
plot(news.cart.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.cart, print.thres=TRUE)C5.0
Train the model on train set
##
## Call:
## C5.0.formula(formula = shares ~ ., data = news[ind == 1, ], method
## = "class")
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Mar 24 04:14:21 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 27046 cases (50 attributes) from undefined.data
##
## Decision tree:
##
## kw_avg_avg <= 53.68346:
## :...data_channel_is_socmed = 1:
## : :...kw_min_max > 3200: 0 (46/15)
## : : kw_min_max <= 3200:
## : : :...min_positive_polarity <= 0.03333334: 1 (272/53)
## : : min_positive_polarity > 0.03333334:
## : : :...self_reference_min_shares > 60.82763: 1 (49/7)
## : : self_reference_min_shares <= 60.82763:
## : : :...avg_negative_polarity <= -0.4787698: 0 (12/1)
## : : avg_negative_polarity > -0.4787698:
## : : :...LDA_00 <= -2.992677:
## : : :...global_rate_negative_words > 0.02540416: 0 (11)
## : : : global_rate_negative_words <= 0.02540416:
## : : : :...n_tokens_title <= 2.302585: 1 (28/11)
## : : : n_tokens_title > 2.302585: 0 (25/6)
## : : LDA_00 > -2.992677:
## : : :...rate_positive_words > 0.9152542: 1 (16)
## : : rate_positive_words <= 0.9152542:
## : : :...global_sentiment_polarity <= 0.2488831: 1 (98/30)
## : : global_sentiment_polarity > 0.2488831: 0 (9/1)
## : data_channel_is_socmed = 0:
## : :...is_weekend = 1:
## : :...data_channel_is_tech = 1: 1 (364/86)
## : : data_channel_is_tech = 0:
## : : :...kw_min_min <= 129:
## : : :...LDA_00 > -1.733872: 1 (196/64)
## : : : LDA_00 <= -1.733872:
## : : : :...kw_max_max <= 690400:
## : : : :...title_sentiment_polarity <= -0.15: 0 (6)
## : : : : title_sentiment_polarity > -0.15: 1 (52/16)
## : : : kw_max_max > 690400:
## : : : :...num_hrefs > 1.414214: 0 (593/253)
## : : : num_hrefs <= 1.414214:
## : : : :...kw_avg_avg <= 53.26955: 0 (53/7)
## : : : kw_avg_avg > 53.26955: 1 (4)
## : : kw_min_min > 129:
## : : :...abs_title_sentiment_polarity > 0.5125: 1 (19)
## : : abs_title_sentiment_polarity <= 0.5125:
## : : :...data_channel_is_entertainment = 0: 1 (120/35)
## : : data_channel_is_entertainment = 1:
## : : :...LDA_01 <= -3.331762: 0 (9)
## : : LDA_01 > -3.331762:
## : : :...num_videos > 1: 0 (3)
## : : num_videos <= 1:
## : : :...min_positive_polarity > 0.0625: 1 (13)
## : : min_positive_polarity <= 0.0625:
## : : :...num_keywords <= 9: 1 (6/1)
## : : num_keywords > 9: 0 (5)
## : is_weekend = 0:
## : :...data_channel_is_tech = 1:
## : :...n_unique_tokens <= 0.3858625: 1 (237/78)
## : : n_unique_tokens > 0.3858625:
## : : :...kw_min_min > 88: 1 (462/187)
## : : kw_min_min <= 88:
## : : :...n_non_stop_unique_tokens > -0.1306202:
## : : :...num_hrefs <= 1.414214: 1 (5)
## : : : num_hrefs > 1.414214: 0 (64/9)
## : : n_non_stop_unique_tokens <= -0.1306202:
## : : :...self_reference_avg_sharess <= 1788.667:
## : : :...num_keywords <= 8:
## : : : :...global_rate_negative_words <= 0.02448709: 0 (349/132)
## : : : : global_rate_negative_words > 0.02448709: 1 (57/21)
## : : : num_keywords > 8:
## : : : :...kw_min_avg > 1668.8:
## : : : :...kw_min_max <= 12700: 1 (14/3)
## : : : : kw_min_max > 12700: 0 (2)
## : : : kw_min_avg <= 1668.8:
## : : : :...max_positive_polarity <= 0.75: 0 (104/14)
## : : : max_positive_polarity > 0.75:
## : : : :...avg_negative_polarity <= -0.3720588: 1 (8/1)
## : : : avg_negative_polarity > -0.3720588: 0 (142/45)
## : : self_reference_avg_sharess > 1788.667:
## : : :...kw_avg_avg <= 50.30047:
## : : :...global_rate_positive_words <= -3.226256:
## : : : :...num_imgs <= 0: 1 (46/14)
## : : : : num_imgs > 0: 0 (295/146)
## : : : global_rate_positive_words > -3.226256:
## : : : :...title_sentiment_polarity <= 0.2590909: 0 (331/112)
## : : : title_sentiment_polarity > 0.2590909:
## : : : :...kw_max_avg <= 57.11774: 0 (7)
## : : : kw_max_avg > 57.11774: 1 (66/27)
## : : kw_avg_avg > 50.30047:
## : : :...kw_min_min <= 0: 1 (399/152)
## : : kw_min_min > 0:
## : : :...n_unique_tokens <= 0.5252708:
## : : :...avg_positive_polarity > 0.4522186: 0 (6)
## : : : avg_positive_polarity <= 0.4522186:
## : : : :...num_keywords > 5: 1 (90/21)
## : : : num_keywords <= 5: [S1]
## : : n_unique_tokens > 0.5252708:
## : : :...num_imgs > 1.414214: 0 (10)
## : : num_imgs <= 1.414214: [S2]
## : data_channel_is_tech = 0:
## : :...kw_max_max <= 617900:
## : :...global_subjectivity <= 0.3330598: 0 (155/41)
## : : global_subjectivity > 0.3330598:
## : : :...n_tokens_title <= 1.94591:
## : : :...kw_min_avg <= 383: 1 (122/42)
## : : : kw_min_avg > 383:
## : : : :...num_keywords <= 4:
## : : : :...num_hrefs <= 2.645751: 1 (10)
## : : : : num_hrefs > 2.645751: 0 (4/1)
## : : : num_keywords > 4:
## : : : :...kw_max_min <= 1400: 0 (18)
## : : : kw_max_min > 1400:
## : : : :...num_self_hrefs <= 1: 0 (3)
## : : : num_self_hrefs > 1: 1 (4)
## : : n_tokens_title > 1.94591:
## : : :...num_keywords <= 4: 0 (115/35)
## : : num_keywords > 4:
## : : :...self_reference_avg_sharess > 1991.4:
## : : :...data_channel_is_entertainment = 0: 1 (382/163)
## : : : data_channel_is_entertainment = 1:
## : : : :...n_tokens_title > 2.484907: 1 (10/1)
## : : : n_tokens_title <= 2.484907:
## : : : :...n_tokens_title <= 2.397895: 0 (78/31)
## : : : n_tokens_title > 2.397895:
## : : : :...num_hrefs <= 2.828427: 0 (4)
## : : : num_hrefs > 2.828427: 1 (6)
## : : self_reference_avg_sharess <= 1991.4:
## : : :...num_self_hrefs > 2:
## : : :...self_reference_max_shares <= 31: 1 (12/3)
## : : : self_reference_max_shares > 31:
## : : : :...num_videos <= 1: 0 (60/8)
## : : : num_videos > 1: [S3]
## : : num_self_hrefs <= 2:
## : : :...n_tokens_title > 2.564949:
## : : :...num_self_hrefs <= 1.414214: 1 (18/2)
## : : : num_self_hrefs > 1.414214: [S4]
## : : n_tokens_title <= 2.564949:
## : : :...min_negative_polarity > -1: 0 (646/264)
## : : min_negative_polarity <= -1: [S5]
## : kw_max_max > 617900:
## : :...data_channel_is_lifestyle = 1:
## : :...kw_max_max <= 690400: 1 (72/29)
## : : kw_max_max > 690400: 0 (160/65)
## : data_channel_is_lifestyle = 0:
## : :...self_reference_avg_sharess > 3185.5:
## : :...kw_max_avg > 62.44197:
## : : :...LDA_00 > -0.1674443: 1 (43/9)
## : : : LDA_00 <= -0.1674443:
## : : : :...n_tokens_title > 2.639057: 1 (25/6)
## : : : n_tokens_title <= 2.639057:
## : : : :...num_hrefs <= 3.162278: 0 (373/140)
## : : : num_hrefs > 3.162278: [S6]
## : : kw_max_avg <= 62.44197:
## : : :...LDA_04 <= 0.02857156: 0 (134/23)
## : : LDA_04 > 0.02857156:
## : : :...data_channel_is_entertainment = 0:
## : : :...global_subjectivity <= 0.4697912:
## : : : :...n_unique_tokens <= 0.3952226: 1 (25/7)
## : : : : n_unique_tokens > 0.3952226: 0 (575/186)
## : : : global_subjectivity > 0.4697912:
## : : : :...max_negative_polarity <= -0.375: 0 (11/1)
## : : : max_negative_polarity > -0.375: [S7]
## : : data_channel_is_entertainment = 1:
## : : :...kw_avg_min > 247.25: 1 (54/25)
## : : kw_avg_min <= 247.25:
## : : :...LDA_00 > -3.555154: 0 (152/25)
## : : LDA_00 <= -3.555154:
## : : :...LDA_00 <= -3.555281: 0 (18/6)
## : : LDA_00 > -3.555281: 1 (8)
## : self_reference_avg_sharess <= 3185.5:
## : :...num_self_hrefs > 3.741657:
## : :...max_positive_polarity <= 0.9: 0 (5)
## : : max_positive_polarity > 0.9: 1 (21/6)
## : num_self_hrefs <= 3.741657:
## : :...kw_max_avg <= 60.29007: 0 (3206/630)
## : kw_max_avg > 60.29007:
## : :...data_channel_is_entertainment = 1: 0 (564/128)
## : data_channel_is_entertainment = 0:
## : :...num_hrefs > 5.09902: 1 (45/18)
## : num_hrefs <= 5.09902:
## : :...LDA_02 > -0.3257387: 0 (396/85)
## : LDA_02 <= -0.3257387: [S8]
## kw_avg_avg > 53.68346:
## :...data_channel_is_entertainment = 1:
## :...is_weekend = 1:
## : :...global_subjectivity > 0.3559193:
## : : :...kw_max_avg <= 60.61333: 0 (23/9)
## : : : kw_max_avg > 60.61333: 1 (318/106)
## : : global_subjectivity <= 0.3559193:
## : : :...num_hrefs <= 2.44949: 0 (24/8)
## : : num_hrefs > 2.44949:
## : : :...max_negative_polarity <= -0.075: 0 (10/2)
## : : max_negative_polarity > -0.075: 1 (7/1)
## : is_weekend = 0:
## : :...kw_avg_avg <= 64.43067:
## : :...kw_max_max <= 617900:
## : : :...global_rate_positive_words <= -3.676709: 1 (9)
## : : : global_rate_positive_words > -3.676709:
## : : : :...kw_min_min <= 88: 1 (27/11)
## : : : kw_min_min > 88: 0 (68/27)
## : : kw_max_max > 617900:
## : : :...num_keywords > 7:
## : : :...min_negative_polarity <= -0.875: 1 (118/54)
## : : : min_negative_polarity > -0.875: 0 (431/166)
## : : num_keywords <= 7:
## : : :...self_reference_avg_sharess <= 3960: 0 (755/209)
## : : self_reference_avg_sharess > 3960:
## : : :...LDA_02 > -1.12707: 0 (28/3)
## : : LDA_02 <= -1.12707:
## : : :...num_self_hrefs <= 2.645751: 0 (310/127)
## : : num_self_hrefs > 2.645751: 1 (26/8)
## : kw_avg_avg > 64.43067:
## : :...max_positive_polarity <= 0.55: 0 (72/27)
## : max_positive_polarity > 0.55:
## : :...num_imgs > 1: 1 (203/70)
## : num_imgs <= 1:
## : :...abs_title_subjectivity <= 0.1375:
## : :...avg_positive_polarity <= 0.485: 0 (38/7)
## : : avg_positive_polarity > 0.485: 1 (4)
## : abs_title_subjectivity > 0.1375:
## : :...kw_min_max > 84800:
## : :...title_subjectivity <= 0.5833333: 0 (7/1)
## : : title_subjectivity > 0.5833333: 1 (2)
## : kw_min_max <= 84800:
## : :...n_tokens_title > 2.484907: 1 (32/6)
## : n_tokens_title <= 2.484907:
## : :...n_unique_tokens <= 0.4723404: 1 (18/2)
## : n_unique_tokens > 0.4723404: 0 (81/34)
## data_channel_is_entertainment = 0:
## :...kw_min_max > 690400: 0 (34/7)
## kw_min_max <= 690400:
## :...LDA_02 > -0.8554933:
## :...is_weekend = 1:
## : :...self_reference_avg_sharess <= 1242: 0 (43/18)
## : : self_reference_avg_sharess > 1242:
## : : :...num_self_hrefs <= 3.872983: 1 (100/20)
## : : num_self_hrefs > 3.872983: 0 (5/1)
## : is_weekend = 0:
## : :...LDA_00 > -1.392776: 1 (91/27)
## : LDA_00 <= -1.392776:
## : :...min_positive_polarity <= 0.03333334:
## : :...kw_min_min > 88: 0 (8)
## : : kw_min_min <= 88:
## : : :...self_reference_min_shares > 51.96152:
## : : :...kw_avg_min <= 1788.125: 1 (35/3)
## : : : kw_avg_min > 1788.125: 0 (2)
## : : self_reference_min_shares <= 51.96152:
## : : :...kw_avg_max > 397300: 1 (7)
## : : kw_avg_max <= 397300:
## : : :...kw_min_max <= 38900: 1 (94/45)
## : : kw_min_max > 38900: 0 (7)
## : min_positive_polarity > 0.03333334:
## : :...self_reference_max_shares <= 40: 0 (258/67)
## : self_reference_max_shares > 40:
## : :...num_keywords <= 4:
## : :...min_positive_polarity > 0.05: 0 (21/1)
## : : min_positive_polarity <= 0.05:
## : : :...max_positive_polarity <= 0.75: 0 (3)
## : : max_positive_polarity > 0.75: 1 (4)
## : num_keywords > 4:
## : :...kw_min_avg > 2151.778: 1 (98/35)
## : kw_min_avg <= 2151.778:
## : :...kw_max_max <= 617900:
## : :...kw_max_min <= 531: 0 (3)
## : : kw_max_min > 531: 1 (14/1)
## : kw_max_max > 617900:
## : :...num_keywords <= 9: 0 (227/77)
## : num_keywords > 9: 1 (71/33)
## LDA_02 <= -0.8554933:
## :...is_weekend = 1: 1 (1459/338)
## is_weekend = 0:
## :...n_unique_tokens <= 0.4378172:
## :...data_channel_is_lifestyle = 0:
## : :...min_positive_polarity <= 0.05: 1 (441/67)
## : : min_positive_polarity > 0.05:
## : : :...abs_title_subjectivity > 0.04666667: 1 (298/74)
## : : abs_title_subjectivity <= 0.04666667:
## : : :...data_channel_is_socmed = 0: 0 (40/16)
## : : data_channel_is_socmed = 1: 1 (4)
## : data_channel_is_lifestyle = 1:
## : :...kw_min_avg <= 1116.552: 0 (61/29)
## : kw_min_avg > 1116.552:
## : :...kw_min_max <= 17000: 1 (47/4)
## : kw_min_max > 17000:
## : :...kw_max_max <= 690400: 1 (3)
## : kw_max_max > 690400: 0 (20/8)
## n_unique_tokens > 0.4378172:
## :...self_reference_min_shares <= 45.82576:
## :...kw_min_min > 4: 1 (300/92)
## : kw_min_min <= 4:
## : :...num_self_hrefs > 3: 1 (147/40)
## : num_self_hrefs <= 3:
## : :...data_channel_is_socmed = 1:
## : :...min_positive_polarity <= 0.03333334:
## : : :...n_tokens_content <= 890: 1 (109/18)
## : : : n_tokens_content > 890: 0 (8/1)
## : : min_positive_polarity > 0.03333334:
## : : :...max_positive_polarity > 0.85:
## : : :...kw_min_min <= 0: 0 (46/13)
## : : : kw_min_min > 0: 1 (25/11)
## : : max_positive_polarity <= 0.85:
## : : :...n_tokens_title > 1.791759: [S9]
## : : n_tokens_title <= 1.791759: [S10]
## : data_channel_is_socmed = 0:
## : :...num_keywords <= 3:
## : :...kw_max_max > 690400: 0 (68/16)
## : : kw_max_max <= 690400:
## : : :...num_imgs <= 0: 1 (5)
## : : num_imgs > 0: 0 (18/6)
## : num_keywords > 3:
## : :...num_imgs > 1:
## : :...num_keywords <= 4: 0 (48/18)
## : : num_keywords > 4: [S11]
## : num_imgs <= 1:
## : :...n_tokens_content > 654: 1 (296/113)
## : n_tokens_content <= 654: [S12]
## self_reference_min_shares > 45.82576:
## :...n_tokens_content <= 87:
## :...n_tokens_title <= 1.94591: 0 (15)
## : n_tokens_title > 1.94591: 1 (32/15)
## n_tokens_content > 87:
## :...data_channel_is_socmed = 1:
## :...num_videos > 2:
## : :...num_videos > 7: 1 (12)
## : : num_videos <= 7:
## : : :...n_unique_tokens <= 0.5530547: 1 (9/2)
## : : n_unique_tokens > 0.5530547: 0 (10)
## : num_videos <= 2:
## : :...min_positive_polarity > 0.16: [S13]
## : min_positive_polarity <= 0.16: [S14]
## data_channel_is_socmed = 0:
## :...self_reference_min_shares <= 86.60254:
## :...data_channel_is_tech = 0:
## : :...kw_avg_avg > 69.36768: 1 (394/108)
## : : kw_avg_avg <= 69.36768: [S15]
## : data_channel_is_tech = 1:
## : :...num_videos > 1: 1 (31/2)
## : num_videos <= 1:
## : :...n_non_stop_unique_tokens > -0.1763514: [S16]
## : n_non_stop_unique_tokens <= -0.1763514:
## : :...num_hrefs > 2: 1 (253/59)
## : num_hrefs <= 2:
## : :...num_imgs > 2.828427: 0 (8/2)
## : num_imgs <= 2.828427: [S17]
## self_reference_min_shares > 86.60254:
## :...average_token_length <= 4.220472:
## :...avg_positive_polarity <= 0.4258903: 1 (24/9)
## : avg_positive_polarity > 0.4258903: 0 (14)
## average_token_length > 4.220472: [S18]
##
## SubTree [S1]
##
## self_reference_min_shares <= 64.03124: 0 (7)
## self_reference_min_shares > 64.03124: 1 (3)
##
## SubTree [S2]
##
## self_reference_max_shares > 152.3155: 1 (10)
## self_reference_max_shares <= 152.3155:
## :...kw_max_max > 690400: 0 (92/26)
## kw_max_max <= 690400:
## :...average_token_length <= 4.915152: 1 (42/14)
## average_token_length > 4.915152: 0 (6)
##
## SubTree [S3]
##
## max_negative_polarity <= -0.15: 0 (3)
## max_negative_polarity > -0.15: 1 (5)
##
## SubTree [S4]
##
## min_positive_polarity <= 0.0625: 1 (2)
## min_positive_polarity > 0.0625: 0 (5)
##
## SubTree [S5]
##
## min_positive_polarity > 0.1: 1 (7)
## min_positive_polarity <= 0.1:
## :...num_imgs <= 0: 0 (9/1)
## num_imgs > 0: 1 (56/22)
##
## SubTree [S6]
##
## data_channel_is_entertainment = 0: 1 (124/53)
## data_channel_is_entertainment = 1: 0 (97/42)
##
## SubTree [S7]
##
## abs_title_subjectivity <= 0.475: 0 (85/34)
## abs_title_subjectivity > 0.475: 1 (93/31)
##
## SubTree [S8]
##
## n_non_stop_unique_tokens <= -0.5900176: 1 (40/13)
## n_non_stop_unique_tokens > -0.5900176:
## :...num_videos <= 0: 0 (731/237)
## num_videos > 0:
## :...kw_max_avg <= 67.2759:
## :...num_imgs <= 1.732051: 0 (94/24)
## : num_imgs > 1.732051:
## : :...max_positive_polarity <= 0.7: 0 (2)
## : max_positive_polarity > 0.7: 1 (5)
## kw_max_avg > 67.2759:
## :...avg_positive_polarity <= 0.2941799: 0 (37/10)
## avg_positive_polarity > 0.2941799:
## :...num_self_hrefs <= 1.732051: 1 (87/25)
## num_self_hrefs > 1.732051: 0 (12/3)
##
## SubTree [S9]
##
## global_subjectivity <= 0.6135198: 1 (84/18)
## global_subjectivity > 0.6135198: 0 (13/3)
##
## SubTree [S10]
##
## min_positive_polarity > 0.375: 1 (2)
## min_positive_polarity <= 0.375:
## :...LDA_04 <= 0.04000269: 1 (2)
## LDA_04 > 0.04000269: 0 (11)
##
## SubTree [S11]
##
## self_reference_min_shares > 31.62278: 1 (705/251)
## self_reference_min_shares <= 31.62278:
## :...data_channel_is_lifestyle = 0: 1 (400/181)
## data_channel_is_lifestyle = 1: 0 (41/16)
##
## SubTree [S12]
##
## data_channel_is_tech = 0: 0 (1158/504)
## data_channel_is_tech = 1:
## :...num_keywords > 7:
## :...n_tokens_content <= 208: 0 (44/8)
## : n_tokens_content > 208: 1 (129/61)
## num_keywords <= 7:
## :...n_tokens_title > 2.079442: 1 (133/43)
## n_tokens_title <= 2.079442:
## :...n_tokens_title <= 1.791759: 1 (5)
## n_tokens_title > 1.791759: 0 (39/13)
##
## SubTree [S13]
##
## self_reference_max_shares <= 96.43651: 0 (6)
## self_reference_max_shares > 96.43651: 1 (7)
##
## SubTree [S14]
##
## self_reference_max_shares > 66.3325: 1 (174/11)
## self_reference_max_shares <= 66.3325:
## :...n_tokens_content <= 150: 0 (5)
## n_tokens_content > 150:
## :...global_rate_negative_words <= 0.02584814: 1 (47/6)
## global_rate_negative_words > 0.02584814: 0 (9/3)
##
## SubTree [S15]
##
## max_negative_polarity <= -0.375: 0 (50/19)
## max_negative_polarity > -0.375: 1 (994/390)
##
## SubTree [S16]
##
## global_rate_negative_words <= 0.009920635: 0 (12/1)
## global_rate_negative_words > 0.009920635: 1 (7/2)
##
## SubTree [S17]
##
## avg_negative_polarity <= -0.1464286: 1 (57/9)
## avg_negative_polarity > -0.1464286:
## :...num_keywords <= 6: 0 (13)
## num_keywords > 6: 1 (7/2)
##
## SubTree [S18]
##
## data_channel_is_lifestyle = 0: 1 (920/245)
## data_channel_is_lifestyle = 1:
## :...self_reference_min_shares > 210.4756: 1 (10)
## self_reference_min_shares <= 210.4756:
## :...self_reference_max_shares > 199.2486: 0 (8)
## self_reference_max_shares <= 199.2486:
## :...kw_min_max > 8100: 1 (25/1)
## kw_min_max <= 8100:
## :...n_tokens_title > 2.302585: 1 (23/6)
## n_tokens_title <= 2.302585:
## :...self_reference_max_shares > 179.722: 1 (4)
## self_reference_max_shares <= 179.722:
## :...num_hrefs <= 4.358899: 0 (23/2)
## num_hrefs > 4.358899: 1 (7/2)
##
##
## Evaluation on training data (27046 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 214 8134(30.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 9929 3878 (a): class 0
## 4256 8983 (b): class 1
##
##
## Attribute usage:
##
## 100.00% kw_avg_avg
## 97.78% is_weekend
## 75.75% data_channel_is_socmed
## 63.03% data_channel_is_entertainment
## 61.60% data_channel_is_tech
## 45.40% LDA_02
## 44.65% kw_max_max
## 43.33% self_reference_avg_sharess
## 42.57% n_unique_tokens
## 41.65% kw_min_max
## 37.17% num_self_hrefs
## 35.75% data_channel_is_lifestyle
## 28.77% kw_min_min
## 27.59% num_keywords
## 27.55% kw_max_avg
## 27.54% self_reference_min_shares
## 18.97% n_tokens_content
## 15.37% num_imgs
## 13.01% n_non_stop_unique_tokens
## 11.81% num_hrefs
## 11.10% global_subjectivity
## 10.63% LDA_00
## 10.37% min_positive_polarity
## 10.26% n_tokens_title
## 6.40% num_videos
## 4.68% self_reference_max_shares
## 4.68% min_negative_polarity
## 4.65% max_negative_polarity
## 4.32% LDA_04
## 4.09% average_token_length
## 3.60% kw_min_avg
## 3.45% max_positive_polarity
## 3.14% global_rate_positive_words
## 2.60% abs_title_subjectivity
## 2.02% global_rate_negative_words
## 1.71% title_sentiment_polarity
## 1.58% avg_negative_polarity
## 1.19% avg_positive_polarity
## 0.99% kw_avg_min
## 0.65% abs_title_sentiment_polarity
## 0.45% rate_positive_words
## 0.40% kw_avg_max
## 0.40% global_sentiment_polarity
## 0.16% kw_max_min
## 0.13% LDA_01
## 0.03% title_subjectivity
##
##
## Time: 1.8 secs
Predict on test set
news.c50.pred<-predict( news.c50,news[ind==2,] ,type="class")
news.c50.prob<-predict( news.c50,news[ind==2,] ,type="prob")Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3773 2160
## 1 1953 3495
##
## Accuracy : 0.6386
## 95% CI : (0.6297, 0.6474)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.277
## Mcnemar's Test P-Value : 0.001318
##
## Sensitivity : 0.6589
## Specificity : 0.6180
## Pos Pred Value : 0.6359
## Neg Pred Value : 0.6415
## Prevalence : 0.5031
## Detection Rate : 0.3315
## Detection Prevalence : 0.5213
## Balanced Accuracy : 0.6385
##
## 'Positive' Class : 0
##
ROC Curve
news.c50.roc <- roc(news[ind==2,]$shares,news.c50.prob[,2])
plot(news.c50.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.cart, print.thres=TRUE)Random Forest
Train the model on train set
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 27046 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 54092 matrix numeric
## oob.times 27046 -none- numeric
## classes 2 -none- character
## importance 49 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 27046 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Plot number of trees vs error
Plot feature importance
Predict on test set
news.rf.pred<-predict( news.rf,news[ind==2,], type="class")
news.rf.prob<-predict( news.rf,news[ind==2,], type="prob")Confusion matrix for test set
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3845 1932
## 1 1881 3723
##
## Accuracy : 0.665
## 95% CI : (0.6562, 0.6736)
## No Information Rate : 0.5031
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.3299
## Mcnemar's Test P-Value : 0.4181
##
## Sensitivity : 0.6715
## Specificity : 0.6584
## Pos Pred Value : 0.6656
## Neg Pred Value : 0.6643
## Prevalence : 0.5031
## Detection Rate : 0.3378
## Detection Prevalence : 0.5076
## Balanced Accuracy : 0.6649
##
## 'Positive' Class : 0
##
ROC Curve
news.rf.roc <- roc(news[ind==2,]$shares,news.rf.prob[,2])
plot(news.rf.roc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col=color.cart, print.thres=TRUE)Model Comparison
ROCCurve<-par(pty = "s")
plot(performance(prediction(news.knn.prob[,2],news[ind==2,]$shares),'tpr','fpr'),
col=color.knn, lwd=3)
text(0.55,0.6,"KNN",col=color.knn)
plot(performance(prediction(news.cart.prob[,2],news[ind==2,]$shares),'tpr','fpr'),
col=color.cart, lwd=3, add=TRUE)
text(0.3,0.4,"CART",col=color.cart)
plot(performance(prediction(news.c50.prob[,2],news[ind==2,]$shares),'tpr','fpr'),
col=color.c50, lwd=3, add=TRUE)
text(0.15,0.5,"C5.0",col=color.c50)
plot(performance(prediction(news.rf.prob[,2],news[ind==2,]$shares),'tpr','fpr'),
col=color.rf, lwd=3, add=TRUE)
text(0.3,0.7,"Random Forest",col=color.rf)